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AN IMPROVED VOICE CONVERSION METHOD AND SYSTEM 
The present invention relates to a method and to a 
system for converting a voice signal that reproduces a 
source speaker's voice into a voice signal that has 
acoustic - characteristics resembling those of a target 
speaker's voice. 

Sound reproduction is of primary importance in voice 
conversion applications such as voice services, oral man- 
machine dialogue and voice synthesis from text, and to 
obtain acceptable • reproduction quality the acoustic 
parameters of the voice signals must be closely 
controlled. 

The main acoustic or prosody parameters modified by 
conventional voice conversion methods are the parameters 
relating to the spectral envelope and, in the case of 
voiced sounds involving vibration of the vocal chords, 
the parameters relating to their periodic structure, i.e. 
their fundamental period, the reciprocal of which is 
called the fundamental frequency or pitch. 

Conventional voice conversion methods are 
essentially based on modifications of the spectral 
envelope characteristics and on overall modifications of 
the pitch characteristics. 

A recent study, published on the occasion of the 
EUROSPEECH 2003 conference under the title "A new method 
for .pitch prediction from spectral envelope and its 
application in voice conversion 11 by Taoufik En-Na j j ary , 
Olivier Rosec, and Thierry Chonavel, foresees the 
possibility of refining the modification of the pitch 
characteristics by defining a function for predicting 
those characteristics as a function of spectral envelope 
characteristics . 

Their approach therefore modifies the spectral 
envelope characteristics and modifies the pitch 
characteristics as a function of the spectral envelope 
characteristics . 
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However, that method has a serious drawback in that 
it makes modification of the pitch characteristics 
dependent on modification of the spectral envelope 
characteristics. An error in spectral envelope 

5 conversion therefore inevitably impacts on pitch 
prediction . 

Moreover, the use of a method of the above kind 
requires two major calculation steps, namely modifying 
the spectral envelope characteristics and predicting the 
10 pitch, thereby doubling the complexity of the system as a 
whole . 

The object of the present invention is to solve 
these problems by defining a simple and more effective 
voice conversion method. 

15 To this end, the present invention consists in a 

method of converting a voice signal as spoken by a source 
speaker into a ' converted voice signal whose acoustic 
characteristics resemble those of a target speaker, the 
method comprising : 

20 • a determination step of determining a function for 

transforming acoustic characteristics of the source 
speaker into acoustic characteristics similar to those of 
the target speaker on the basis of samples of the voices 
of the source and target speakers, and 

25 ■ a transformation step of transforming . acoustic 

characteristics of the source speaker voice signal to be 
converted by applying said transformation function, 

which method is characterized in that said 
determination step comprises a step of determining a 

30 function for conjoint transformation of characteristics 
of the source speaker relating to the spectral envelope 
and of characteristics of the source speaker relating to 
the pitch and said transformation step comprises applying 
said joint transformation function. 

35 The method of the invention therefore modifies the 

spectral envelope characteristics and the pitch 
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characteristics simultaneously in a single operation 
without making them interdependent. 

According to other features of the invention: 

■ said step of determining a joint transformation 
5 function comprises: 

• a step of analyzing source and target speaker 
voice samples grouped into frames to obtain for each 
frame information relating to the spectral envelope and 
to the pitch, 

10 • a step of concatenating information relating 

to the spectral envelope and information relating to the 
pitch for each of the source and target speakers, 

• a step of determining ' a model representing 
common acoustic characteristics of source speaker and 

15 target speaker voice samples, and 

• a step of determining said conjoint 
transformation function from said model and the voice 
samples ; 

• said steps of analyzing the source and target 
20 speaker voice samples are adapted to produce said 

information relating to the spectral envelope in the form 
of cepstral coefficients; 

• said analysis steps comprise respectively a step 
achieving voice samples models as a summation of an 

25 harmonic and noise, each achieving step comprising : 

• a substep of estimating the pitch of the 
voice samples, 

• a substep of synchronized analyzing the pitch 
of each samples frame, and 

30 • a substep of estimating spectral envelope 

parameters of each sample frame; 

• said step of determining a model determines a 
mixture model of Gaussian probability density ; 

■ said step of determining a model comprises: 

35 • a substep of determining a model 

corresponding to a mixture of Gaussian probability 
densities, and 
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■ a substep of estimating parameters of the 
mixture of Gaussian probability densities from an 
estimated maximum likelihood between the acoustic 
characteristics of the source and target speaker samples 
5 and the model; 

• said step of determining a transformation function 
further includes a step of normalizing the pitch of the 
frames of respective source and target speaker samples 
relative to average values of the pitch of the respective 

10 analyzed source and target speaker samples; 

- the method includes a step of temporally aligning 
the acoustic characteristics of the source speaker with 
the acoustic characteristics of the target speaker, this 
step being achieved before said step of determining a 

15 model; 

• the method includes a step of separating voiced 
frames and non-voiced frames in the source speaker and 
target speaker voice samples, said step of determining a 
conjoint transformation function of the characteristics 

. 20 relating to the spectral envelope and to the pitch being 
based entirely on said voiced frames and the method 
including a step of determining a function for 
transformation of only the spectral envelope 
characteristics on the basis only of said non-voiced 
25 frames; 

• said step of determining a transformation function 
comprises only said step of determining a conjoint 
transformation function ; 

• said step of determining a conjoint transformation 
30 function is based on an estimate of the acoustic 

characteristics . of the target speaker, the acoustic 
characteristics of the source speaker being known ; 

• said estimate is the conditional expectation of 
the acoustic characteristics of the target speaker 

35 achievement of the acoustic characteristics of the source 
speaker being known ; 
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• said step of transforming acoustic characteristics 
of the voice signal to be converted comprises: 

• a step of analyzing said voice signal, 
grouped into frames, to obtain for each frame information 

5 relating to the spectral envelope and to the pitch, 

• a step of formatting the acoustic information 
relating to the spectral envelope and to the pitch of the 
voice signal to be converted, and 

• a step of transforming the formatted acoustic 
10 information of the voice signal to be converted using 

said conjoint transformation function; 

• the method includes a step of separating voiced 
frames and non-voiced frames in said voice signal to be 
converted, said transformation step comprising: 

15 ■ a substep of applying said conjoint 

transformation function only to voiced frames of said 
signal to be converted, and 

• a substep of applying said transformation 
function of the spectral envelope characteristics only to 

20 non-voiced frames of said signal to be converted; 

• said transformation step comprises applying said 
conjoint transformation function to the acoustic 
characteristics of all the frames of said voice signal to 
be converted; 

25 • • the method further includes a step of synthesizing 

a converted voice signal from said transformed acoustic 
information . 

The object of the invention is also a system for 
converting a voice signal as spoken by a source speaker 
30 into a converted voice signal whose acoustic 
characteristics resemble those of a target speaker, the 
system comprising : 

• means for determining a function for transforming 
acoustic characteristics of the source speaker into 

35 acoustic characteristics close to those of the target 
speaker on the basis of voice samples as spoken by the 
source and target speakers, and 

[ 



• means for transforming acoustic characteristics -of 
the source speaker voice signal to be converted by 
applying said transformation function, 

the said system is characterized in that said means 
for determining a transformation function comprise a unit 
for determining a function for conjoint transformation of 
characteristics of the source speaker relating to the 
spectral envelope and of characteristics of the source 
speaker relating to the pitch and said transformation 
means include means for applying said conjoint 
transformation function . 

According to other features of the above system: 

• it further includes: 

• means for analyzing the voice signal to be 
converted, adapted to output information relating to the 
spectral envelope and to the pitch of the voice signal to 
be converted, and 

• synthesizer means for forming a converted 
voice signal from at least said spectral envelope and 
pitch information transformed simultaneously; and 

• said means for determining an acoustic 
characteristic transformation function further include a 
unit for determining a transformation function for the 
spectral envelope of non-voiced frames, said unit for 
determining the conjoint transformation function being 
adapted to determine the conjoint transformation function 
only for voiced frames. 

The invention can be better understood after reading 
the following description, which is given by way of 
example only and with reference to the appended drawings, 
in which: 

• Figures 1A and IB together form a general 
flowchart of a first embodiment of the method according 
to the invention ; 

• Figures 2A and 2B together form a general 
flowchart of a second embodiment of the method according 
to the invention; 



• Figure 3 is a graph view showing experimental 
measurements of performance of the method according to 
the invention; and 

■ Figure 4 is a block diagram of a system 
.5 implementing a method according to the invention. 

Voice conversion consists in modifying a voice 
signal reproducing the voice of a reference speaker, 
called the source speaker, so that the converted signal 
appears to reproduce the voice of another speaker, called 
10 the target speaker. 

A method of this kind begins by determining 
functions for. converting acoustic or prosody 
characteristics of the voice signals for the source 
speaker into acoustic characteristics close to those of 
15 the voice signals for the target speaker on the basis of 
voice samples as spoken by the source speaker and the 
target speaker. 

A conversion function determination step 1 is more 
particularly based on databases of voice samples 
20 corresponding to the acoustic production of the same 
phonetic sequences as spoken by the source and target 
speakers . 

This process, which is often referred to as 
"training", is designated by the general reference number 
25 1 in Figure 1A. 

The method then uses the function (s) that have been 
determined to convert the acoustic characteristics of a 
voice signal to be converted as spoken by the source 
speaker. In Figure IB this conversion process is 

30 designated by the general reference number 2. 

The method starts with steps 4X and 4Y that analyze 
voice samples as spoken by the source and target 
speakers, respectively. The samples are grouped into 
frames in these steps in order to obtain spectral 
35 envelope information and pitch information for each 
frame. 
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In the present embodiment, the analysis steps 4X and 
4Y use a sound signal model formed with the sum of a 
harmonic signal and a noise signal, usually called the 
harmonic plus noise model (HNM) . 

The harmonic plus noise model models each voice 
signal frame as a harmonic portion representing the 
periodic component of the signal, consisting of a sum of 
L harmonic sinusoids of amplitude Ai and phase <j) lf and a 
noise portion representing friction noise and the 
variation of glottal excitation. 

We may therefore write: 

s (n) = h(n) + b(n) 



The term h(n) therefore represents the harmonic 
approximation of the signal s (n) . 

The present embodiment is based on representing the 
spectral envelope by means of a discrete cepstrum. 

The steps 4X and 4Y include substeps 8X and 8Y that 
estimate the pitch for each frame, for example using an 
autocorrelation method. 

The substeps 8X and 8Y are followed by substeps 10X 
and 10Y of pitch synchronized analysis of each frame in 
order to estimate the parameters, of the harmonic portion 
of the signal and the parameters of the noise, in 
particular the maximum voicing frequency. Alternatively, 
this frequency may be fixed arbitrarily or estimated by 
other means known in the art. 

In the present embodiment, this synchronized 
analysis determines the parameters of the harmonics by 
minimizing a weighted least squares criterion between the 
complete signal and its harmonic decomposition, 
corresponding in the present embodiment to the estimated 
noise signal. The criterion E is given by the following 
equation, in which w(n) is the analysis window and T± is 
the fundamental period of the current frame: 



where : 



h(n) 




i=] 



Ti 

E = £wXn)(s(n)-h(n)) 2 
n=-Ti 

The analysis window is therefore centered around the 
mark of the fundamental period and its duration is twice 
that period. 

Alternatively, these analyses are effected 
asynchronously using a fixed analysis step and a fixed 
window size. 

The analysis steps 4X and 4Y finally include 
substeps 12X and 12Y that, estimate the parameters of the 
spectral envelope of the signals using a regularized 
discrete cepstrum method and a Bark scale transformation, 
for example, to reproduce the properties of the human ear 
as faithfully as possible. 

For each frame of rank n of voice signal samples, 
the analysis steps 4X and 4Y therefore deliver, for the 
voice samples as spoken by the source and target 
speakers, respectively, a scalar F n representing the pitch 
and a vector c n comprising spectral envelope information 
in the form of a sequence of cepstral coefficients. 

The cepstral coefficients are calculated by a method 
that is known in the art and for this reason is not 
described in detail here. 

The analysis steps 4X and 4Y are advantageously 
followed by steps 14X and 14Y that normalize the value of 
the pitch of each frame relative to the pitch of the 
source and target speakers, respectively, in order to 
replace the pitch value for each voice sample frame with 
a pitch value normalized according to the following 
formula : 



g = F log = log 



f \ 
Fo 



o 



J 



avg 

In the above formula, F corresponds to the 

o 

averages of the pitch values over each database analyzed, 
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i.e. over the database of source speaker and target 
speaker voice samples. 

For each speaker, this normalization modifies the 
pitch scalar variation scale to render it consistent with 
5 the cepstral coefficient variation scale. For each frame 
n, g x (n) is the pitch normalized for the source speaker 
and g y (n) is the pitch normalized for the target speaker. 

The method, of the invention then includes steps 16X 
and 16Y that concatenate spectral envelope and pitch 
10 information in the form of a single vector for each 
source and target speaker. 

Thus the step 16X defines for each frame n a vector 
x n grouping together the cepstral coefficients c x (n) and 
the normalized pitch g x (n) in accordance with the 
15 following equation, in which T denotes the transposition 
operator : 

xn = [d(n) I g x (n) 

Similarly, the step 16Y defines for each frame n a 
vector y n grouping together the cepstral coefficients 
20 c y (n) and the normalized pitch g y (n) in accordance with 
the following equation: 

yn = [cy(n),g y (n) 

The steps 16X and 16Y are followed by a step 18 that 

aligns the source vector x n and the target vector y n to 
25 match these vectors by means of a conventional dynamic 

time warping algorithm. 

Alternatively, the alignment step 18 is implemented 

on the basis of only the cepstral coefficients, without 

using the pitch information. 
30 The alignment step 18 therefore delivers a pair 

vector formed of pairs of cepstral coefficients and pitch 

information for the source and target speakers, aligned 

temporally . 

The alignment step 18 is followed by a step 20 that 
35 determines a model representing acoustic characteristics 



IT 
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common to the source speaker and the target speaker from 
the spectral envelope and pitch information for • all of 
the samples that have been analyzed. 

In the present embodiment, this model is a 
5 probabilistic model of the target speaker and source 
speaker acoustic characteristics in the form of a 
Gaussian mixture model (GMM) utilizing a mixture of 
probability densities and the parameters thereof are 
estimated from source and target vectors containing the 
10 normalized pitch and the discrete cepstrum for each 
speaker. 

In a Gaussian mixture model (GMM) the probability 
density of a random variable p(z) is conventionally 

expressed in the following mathematical form: 

o 

15 p(z)= Ja^^/z/XJ 

/=; 

where: 

Q 

2>,=1, 0<oci<l 

In the above formula, Q denotes the number of 
components of the model, N(z;pi,Ei) is the probability 
20 density of the normal law with average ]ii and covariance 
matrix Z if and the coefficients a± are the coefficients of 
the mixture . 

The coefficient a± therefore corresponds to the a 
priori probability that the random variable z is 
25 generated by the i th Gaussian component of the mixture. 

The step 20 that determines the model more 
particularly includes a substep 22 that models . the 
conjoint density p(z) of the source vector x and the 
target vector y such that: 

30 Zn=[x^ypJ 

The step 20 then includes a substep 24 that 
estimates the GMM parameters (a, E) of the density 

p(z), for example using a conventional algorithm of the 
Expectation - Maximization (EM) type corresponding to an 
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iterative method of estimating the maximum likelihood 
between the data of the voice samples and the Gaussian 
mixture model. 

The initial GMM parameters are determined using a 
conventional vector quantizing technique. 

The step 20 that determines the model therefore 
delivers the parameters of a Gaussian probability density 
mixture representing common acoustic characteristics of 
the source speaker and target speaker voice samples, in- 
particular their spectral envelope and pitch 
characteristics . 

The method then includes a step 30 that determines 
from the model and the voice samples a conjoint function 
that transforms the pitch and spectral envelopes of the 
signal obtained from the cepstrum from the source speaker 
to the target speaker. 

This transformation function is determined from an 
estimate of the acoustic characteristics of the target 
speaker produced from the acoustic characteristics of the 
source speaker, taking the form in the present embodiment 
of the conditional expectation. 

To this end, the step 30 includes a substep 32 that 
determines the conditional expectation of the acoustic 
characteristics of the target speaker given the acoustic 
characteristics information for the source speaker. The 
conditional expectation F(x) is determined from the 

following formulas : 

i JL< y yx xx - x 

F(x)=E[y|x]=T/^*)L" +2 (2 . r'U-u .)] 

w III i 

X XX 

OiN(x, ju , 2 , ) 
where: hi(x)=— 

M J J 

4 



where : 



XX XV 



and fj i= 
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In the above equations, hj.(x) is the a posteriori 
probability that the source vector x is generated by the' 
i th component of the Gaussian density mixture model of the 
model . 

5 Determining the conditional expectation therefore 

yields the function for conjoint transformation of the 
spectral envelope and pitch characteristics between the 
source speaker and the target speaker. 

It is therefore apparent that, from the model and 
10 the voice samples, the analysis method of the invention 
yields a function for conjoint transformation of the 
pitch and spectral envelope acoustic characteristics. 

Referring to Figure IB, the conversion method then 
includes the step 2 of transforming a voice signal to be 
15 converted, as spoken by the source speaker, which may be 
different from the voice signals used here above. 

This transformation step 2 starts with an analysis 
step 36 which, in the present embodiment, effects an HNM 
breakdown similar to those effected in the steps 4X and 
20 4Y described above. This step 36 delivers spectral 
envelope information in the form of cepstral 
coefficients, pitch information and maximum voicing 
frequency and phase information. 

The step 36 is followed by a step 38 that formats 
25 the acoustic characteristics of the signal to be 
converted by normalization of the pitch and concatenation 
with the cepstral coefficients in order to form a single- 
vector. 

That single vector is used in a step 40 that 
30 transforms the acoustic characteristics of the voice 
signal to be converted by applying the transformation 
function determined in the step 30 to the cepstral 
coefficients of the signal to be converted defined in the 
step 36 and to the pitch information. 
35 Thus after the step 40, each frame of source speaker 

samples of the signal to be converted is associated with 
simultaneously transformed spectral envelope and pitch 
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information the characteristics thereof are similar to 
those of the target speaker samples. 

The method then includes a step 42 that denormalizes 
the transformed pitch information. 
5 This step 42 returns the transformed pitch 

information to a scale appropriate to the target speaker, 
in accordance with the following equation: 

K[F(x)) = F aVg (y) .e ^{n)\ 
o 

In the above equation F 0 [F(x)] is the denormali zed 
10 transformed pitch, F D avg (y) is the average of the values of 
the pitch of the target speaker, and F[g x (n)] is the 
transform of the normalized pitch of the source speaker. 

The conversion method then includes a conventional 
step 44 that synthesizes the output signal, in the 
15 present example by an HNM type synthesis that delivers 
directly the voice signal converted from the transformed 
spectral envelope and pitch information produced by the 
step 40 and the maximum voicing frequency and phase 
information produced by the step 36. 
20 The voice conversion method using the analysis 

method of the invention therefore yields a voice 
conversion that jointly achieves spectral envelope and 
pitch modifications to obtain sound reproduction of good 
quality . 

25 A second embodiment of the method according to the 

invention is described next with reference to the general 
flowchart shown in Figure 2A. 

As here above, this embodiment of the method 
includes the determination 1 of . functions for 
30 transforming acoustic characteristics of the source 
speaker into acoustic characteristics close to those of 
the target speaker. 

This determination step 1 starts with the execution 
• of the steps 4X and 4Y of analyzing voice samples as 
35 spoken by the source speaker and the target speaker, 
respectively. 



These steps 4X and 4Y use the harmonic plus noise 
model (HNM) described above and each produces a scalar 
F(n) representing the pitch and a vector c(n) comprising 
spectral envelope information in the form of a sequence 
of cepstral coefficients. 

In this embodiment, these analysis steps 4X and 4Y 
are followed by a step 50 of aligning the cepstral 
coefficient vectors obtained by analyzing the source 
speaker and target speaker frames. 

This step 50 is executed by an algorithm such as the 
DTW algorithm, in a similar manner to the step 18 of the 
first embodiment. 

After the alignment step 50, a pair vector is 
available formed of pairs of cepstral coefficients for 
the source speaker and the target speaker, aligned 
temporally. This pair vector is also associated with the 
pitch information . 

The alignment step 50 is followed by a separation 
step 54 in which voiced frames and non-voiced frames in 
the pair vector are separated. 

Only the voiced frames have a pitch and the frames 
can be sorted by considering whether pitch information 
exists for each pair of the pair vector. 

This separation step 54 enables the subsequent step 
56 of determining a function for conjoint transformation 
of the spectral envelope and pitch characteristics of 
voiced frames and the subsequent step 58 of determining a 
function for transformation of only the spectral envelope 
characteristics of non-voiced frames. 

The step 56 of determining a transformation function 
for voiced frames starts with steps 60X and 60Y of 
normalizing the pitch information for the source and 
target speakers, respectively. 

These steps 60X and 60Y are executed in a similar 
way to the steps 14X and 14Y of the first embodiment and, 
for each voiced frame, produce the normalized frequencies 
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g x (n) for the source speaker and g y (n) for the target 
speaker . 

These normalization steps 60X and 60Y are followed 
by steps 62X and 62Y that concatenate the cepstral 
5 coefficients c >: and c y for the source speaker and the 
target speaker, respectively, with the normalized 
frequencies g x and g y . 

These concatenation steps 62X and 62Y are executed 
in a similar way to the steps 16X and 16Y and produce a 
10 vector x n containing spectral envelope and pitch 
information for voiced frames from the source speaker and 
a vector y n containing normalized spectral envelope and 
pitch information for voiced frames from the target 
speaker . 

15 In addition, the alignment between these two vectors 

is kept as achieved at the end of the step 50, the 
modifications made during the normalization steps 60X and 
60Y and the concatenation steps 62X and 62Y being 
effected directly on the vector outputted from the 

20 alignment step 50. 

The method next includes a step 70 of determining a 
model representing the common characteristics of the 
source speaker and the target speaker. 

Differing in this respect from the step 20 described 

25 with reference to Figure 1A, this step 70 uses pitch and 
spectral envelope information of only the analyzed voiced 
samples . 

In this embodiment , this step 7 0 is based on a 
probabilistic model according to a Gaussian mixture model 
30 (GMM) . 

Thus the step 70 includes a substep 72 of modeling 
the conjoint density for the vectors X and Y executed in 
a similar way to the substep 22 described above. 

This substep 72 is followed by a substep 74 for 
35 estimating the GMM parameters (a, ja, S) of the density 
P (z) . 
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As in the embodiment described above, this estimate 
is obtained using an EM-type algorithm resulting in 
obtaining an estimate of the maximum likelihood between 
the voice sample data and the Gaussian mixture model. 
5 The step 70 therefore delivers the parameters of a 

Gaussian probability density mixture representing the 
common spectral envelope and pitch acoustic 
characteristics of the voiced source speaker ■ and target 
speaker voice samples. 

10 The step 70 is followed by a step 80 of determining 

a function for conjoint transformation of the pitch and 
the spectral envelope of the voiced voice samples from 
the source speaker to the target speaker. 

This step 80 is operated in a similar way as the 

15 step 30 of the first embodiment and in particular 
includes a substep 82 of determining the conditional 
expectation of the acoustic characteristics of the target 
speaker given the acoustic characteristics of the source 
speaker, this substep applying the same formulas as here 

20 above to the voiced samples. 

The step 80 therefore yields a function for conjoint 
transformation of the spectral envelope and pitch 
characteristics between the source speaker and the target 
speaker that is applicable to the voiced frames. 

25 A step 58 of determining a transformation function 

for the spectral envelope characteristics of only non- 
voiced frames is executed in parallel with the step 56 of 
determining the transformation function for voiced 
frames . 

30 In the present embodiment, the determination step 58 

includes a step 90 of determining a filter function based 
on spectral envelope parameters, based on pairs of non- 
voiced frames. 

This step 90 is achieved in the conventional way by 

35 determining a Gaussian mixture model or by any other 
appropriate technique known in the art. 
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A function for transformation of the spectral 
envelope characteristics of non-voiced frames is achieved 
at the end of the determination step 58 . 

Referring to Figure 2B, the method then includes the 
step 2 of transforming the acoustic characteristics of a 
voiced signal to be converted. 

As in the previous embodiment, this transformation 
step 2 begins with a step 36 of analyzing the voice 
signal to be converted using a harmonic plus noise model 
(HNM) and a formatting step 38. 

As stated above, these steps 36 and 38 produce the 
spectral envelope and normalized pitch information in the 
form of a single vector. The step 36 also produces 
maximum voicing frequency and phase information. 

In the present embodiment, the step 38 is followed 
by a step 100 of separating voiced and non-voiced frames 
in the analyzed signal to be converted. 

This separation is based on a criterion founded on 
the presence of non-null pitch information. 

The step 100 is followed by a step 102 of 
transforming the acoustic characteristics of the voice 
signal to be converted by applying the transformation 
functions determined in the steps 80 and 90. 

• This step 102 more particularly includes a substep 
104 of applying the function for conjoint transformation 
of the spectral envelope and pitch information determined 
in the step 80 to only the voiced frames separated out in 
the step 100. 

In parallel, the step 102 includes a substep 106 of 
applying the function for transforming only the spectral 
envelope information determined in the step 90 to only 
the non-voiced frames separated out in the step 100. 

The substep 104 therefore outputs, for each voiced 
sample frame of the source speaker signal to be 
converted, simultaneously transformed spectral envelope 
and pitch information whose characteristics are similar 
to those of the target speaker voiced samples. 



The substep 106 outputs transformed spectral 
envelope information for each frame of non-voiced samples 
of the source speaker signal to be converted, the 
characteristics thereof are similar to those of the non- 
voiced target speaker samples. 

In the present embodiment, the method further 
includes a step 108 of de-normalizing the transformed 
pitch information produced by the transformation substep 
104 in a similar manner to the step 42 described with 
reference to Figure IB. 

The conversion method then includes a step 110 of 
synthesizing the output signal, in the present example by- 
means of an HNM type synthesis that delivers the voice 
signal converted on the basis of the transformed spectral 
envelope and pitch information and maximum voicing 
frequency and phase information for voiced frames and on 
the basis of ..transformed spectral envelope information 
for non-voiced frames. 

This embodiment of the method of the invention 
therefore processes voiced frames and non-voiced frames 
differently, voiced frames undergoing simultaneous 
transformation of the spectral envelope and pitch 
characteristics and non-voiced frames undergoing 
transformation of only the spectral envelope 
characteristics . 

An embodiment of this kind provides more accurate 
transformation than the previous embodiment while keeping 
a limited complexity . 

The efficiency of conversion can be assessed from 
identical voice samples as spoken by the source speaker 
and the target speaker. 

Thus the voice signal as spoken by the source 
speaker is converted by the method of the invention and 
the resemblance of the converted signal to the signal as 
spoken by the target speaker is assessed. 

The resemblance is calculated in the form of a ratio 
between the acoustic distance between the converted 
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signal and the target signal and the acoustic distance 
between the target signal and the source signal, for 
example. 

Figure 3 shows a graph of the results obtained in 
the case of converting a male voice into a female voice, 
the transformation functions being obtained using 
training bases each containing five minutes of speech 
sampled at 16 kHz, the cepstral vectors used being of 
size 20 and the Gaussian mixture model having 64 
components . 

In this graph the frame numbers are plotted on the 
abscissa axis and the signal frequency in Hertz is 
plotted on the ordinate axis. 

The results shown are characteristic of voiced 
frames running from approximately frame 20 to frame. 85. 

In this graph, the curve Cx represents the pitch 
characteristics of the source signal and the curve Cy 
represents ones of the target "-signal . 

The curve CI represents the pitch characteristics of 
a signal obtained by conventional linear conversion. 

It is apparent that this signal has the same general 
shape as the source signal represented by the curve Cx. 

Conversely, the curve C2 represents the pitch 
characteristics of a signal converted by the method of 
the invention as described with reference to Figures 2A 
and 2B. 

It is obvious that the pitch curve of the signal 
converted by the method of the invention has a general 
shape that is very similar to that of the target pitch 
curve Cy. 

Figure 4 is a functional block diagram of a voice 
conversion system using the method described with 
reference to Figures 2A and 2B. 

This system uses input from a database 120 of voice 
samples as spoken by the source speaker and a database 
122 containing at least the same voice samples as spoken 
by the target speaker. 
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These two databases are used by a module 124 for 
determining functions for transforming acoustic 
characteristics of the source speaker into acoustic 
characteristics of the target speaker. 
5 The module 124 is adapted to execute the steps 56 

and 58 of the method described with reference to Figure 2 
and thus can determine a transformation function for the 
spectral envelope of non-voiced frames and a conjoint 
transformation function for the spectral envelope and 
10 pitch of voiced frames. 

Generally,, the module 124 includes a unit 126 for 
determining a function for conjoint transformation of the 
spectral envelope and the pitch of voiced frames and a 
unit 128 for determining a function for transformation of 
15 the spectral envelope of non-voiced frames. 

The voice conversion system receives at input a 
voice signal 130 to be converted reproducing the speech 
of the source speaker. 

The signal 130 is fed .into a signal analyzer module 
20 132 producing a harmonic plus noise model (HNM) type 
breakdown, for example, to dissociate spectral envelope 
information of the signal 130 in the form of cepstral 
coefficients and pitch information. The module 132 also 
outputs maximum voicing frequency and phase information 
25 by applying the harmonic plus noise model. 

Thus the module 132 implements the step 36 of the 
method described above and advantageously also implements 
the step 38 . 

Eventually, the information produced by this 
30 analysis may be stored for subsequent use. 

The system also includes a module 134 for separating 
voiced frames and non-voiced frames in the analyzed voice 
signal to be converted. 

Voiced frames separated out by the module 134 are 
35 forwarded to a transformation module 136 adapted to apply 
the conjoint transformation function determined by the 
unit 126. 
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- Thus the transformation module 136 implements the 
step 104 described with reference to Figure 2B and 
advantageously also ' implements the denormalization step 
108 . 

5 Non-voiced frames separated out by the module 134 

are forwarded ■ to a transformation module 128 adapted to 
transform the cepstral coefficients of the non-voiced 
frames . 

The non-voiced frame transformation module 138 
10 therefore implements the step 106 described with 
reference to Figure 2B. . 

The system further includes a synthesizing module 
140 receiving as input, for voiced frames, the conjointly 
transformed spectral envelope and pitch information and 
15 the maximum voicing frequency and phase information 
produced by the module 136. The module 140 also receives 
the transformed cepstral coefficients for • non-voiced 
frames produced by the module 138. 

The module 140 therefore implements the step 110 of 
20 the method described with reference to Figure 2B and 
delivers a signal 150 corresponding to the voice signal 
130 for the source speaker with its spectral envelope and 
pitch characteristics modified to resemble those of the 
target speaker. 

25 The system described may be implemented in various 

ways and in particular using appropriate computer 
programs and sound acquisition hardware. 

In the context of application of the method of the 
invention as described with reference to Figures 1A and 
30 IB, the system includes, in the form of the module 124, a 
single unit for determining a conjoint spectral envelope 
and pitch transformation function. 

In such an embodiment, the separation module 134 and 
the non-voiced frame transformation function application 
35 module 138 are not needed. 

The module 136 therefore is able to apply only the 
conjoint transformation function to all the frames of the 



voice signal to be converted and to deliver the 
transformed frames to the synthesizing module 140. 

Generally, . the system is adapted to implement all 
the steps of the methods described with reference to 
Figures 1 and 2. 

In all cases, the system can also be applied to 
particular databases to form. databases comprising 
converted .signals that are ready to use. 

For example, the analysis is performed offline and 
the HNM analysis parameters are stored for subsequent use 
in the step 40 o.r 100 by the module 134. 

Finally, depending on the complexity of the signals 
and the quality required, the method and the system of 
the invention may operate in real time. 

Embodiments other than those described may be 
envisaged, of course. 

In particular, the HNM and GMM type models may be 
replaced by other techniques . .and models known to the 
person skilled in the art. For example, the analysis may 
use linear predictive coding (LPC) techniques and 
sinusoidal or multiband excited (MBE) models and the 
spectral parameters may be line spectrum frequency (LSF) 
parameters or parameters linked to formants or to a 
glottal signal. Alternatively, vector quantization 

(Fuzzy VQ) may replace the Gaussian mixture model. 

Alternatively, the estimate used in the step 30 may 
be a maximum a posteriori (MAP) criterion corresponding 
to calculating the expectation only for the model that 
best represents the source-target pair. 

In another variant, a conjoint transformation 
function is determined using a least squares technique 
instead of the conjoint density estimation technique 
described here. 

In that variant, determining a transformation 
function includes modeling the probability density of the 
source vectors using a Gaussian mixture model and then 
determining the parameters of the model using an 
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Expectation - Maximization (EM) algorithm. The modeling 
then takes into account of source speaker speech segments 
for which counterparts as spoken by the target speaker 
are not available. 
5 The determination process then obtains the 

transformation function by minimizing a least squares 
criterion between the target and source parameters. It 
should be noticed that the estimate of this function is 
always expressed in the same way but that the parameters 
10 are estimated differently and additional data is taken 
into account. 



